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Real-Time MobUe Video Communication with Low Power Terminals 

1 Background of the invention 

LI Field of the Invention 

Mobile communication is currently one of the most emerging markets and is expected to 
continue to expand. Today the functionalities of mobile communications are very limited. 
Only speech or data can be handled. It is expected that image information, especially 
real-time video information, will gready add to the value of mobile communication. Low 
cost mobile video transmission is highly required by many practical applications, e.g. 
mobile visual communication, live TV news report, mobile surveillance, telemedicine, virtual 
reality, computer game, personal travel assistance, underwater visual communication, space 
communication, etc. However, diere are indeed some problems with inclusion of live video 
information into mobile communications. Different from speech information, video 
information normally needs greater bandwidth and processing performance. In contrast, the 
mobile terminals of today and tomorrow suffer from certain limitations, e.g. 

1 mobile terminals usually have limited power, typical transmission output power 
levels, 10 microwatts - / watt. 

2 the terminals have limited capability to wirelessly transmit data, Ikb/s -lOkb/s 

Based on these ^ts, real-time mobile video transmission can only be achieved when a high 
efficient compression algorithm with very low implementation complexity can be 
implemented. 

L2 Description of the Prior Art 

To compress motion pictures, a simple solution is to compress pictures on a frame by frame 
basis, e.g by means of the JPEG algorithm. This will result in a rather high bitrate although 
at low complexity. To achieve high compression efficiency, we need to employ advanced 
video compression algorithms. Up to now, different types of video compression algorithms 
have been developed. Typical examples include H.263-type block-based, 3D model-based, 
and segmentation based coding algorithms. Although based on different coding principles, 
these algorithms adopt a similar coding structure, namely the closed-loop coding structure. 
This unified coding structure is shown in Fig.l, In this structure, important blocks are 
image analysis, image synthesis, spatial encoder/decoder and modeling. Video 
compression algorithms differ themselves in the image analysis and synthesis part. Amongst 
the blocks, the most computation consuming one is image analysis. The following are 
typical image analysis tasks performed in different algorithms: 

IIJ263 coding algorithm The main function of the image analysis part in 
H.263 algorithm is to perform motion estimation. For motion estimation, 
commonly a block matching scheme is used. Unfortunately, motion estimation 
is very rime- and power-consuming although the employed motion model is 
very simple. For example, if a simple correlation measure is used, the sum of 
the absolute value of differences (SAD) is computed. For a full search over a 
+- 7 displacement range at QCIF (176x1 44 pixels) resolution and lOframes/s, 
the SAD computation would have to be performances over 57 million times 
each second. With increased search range and a finer resolution of motion 
vectors is needed, the computation will be greatly increased. 
Model-based coding algoritiun The task of image analysis here is to extract 
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animation parameters based on models. The models can be given a prior or 
to be built during coding processing. Since the animation parameters are 
related to 3 dimensional information, e.g. 3D motion and shape and the 
observable is 2D image information, a good strategy is needed. A poweiful 
tool based on the analysis by synthesis principle (ABS) is shcfwn in Fig.2. 
Obviously, this is the second closed-loop appearing in the encoder. The 
purpose of using this loop is to aid or verify image analysis. This is done by 
means of the image synthesis block by comparing with the reference image. 
Therefore, the complexity of image analysis is rather complicated. In contrast, 
image synthesis is much simpler. In addition, in the beginning of the session, 
the analysis part needs to do "initial wojics'\ such as, objea location or 
identification, feature points localization, the fitting of a generic model to the 
specific object appearing in the scenes, and initial pose estimation. This type 
of work gives rise to a very heavy computational load. 
Segmentation-based coding algorithm The task of image analysis here is to 
segment images into objects according to assumed definition of objects, to 
track the objects and to estimate their motion or sfwpe parameters. The 
commonly used models are 2D planar object with affine motion, flexible 3D 
objeas with 3D motion. Obviously, the estimation of the motion and shape 
parameters associated with these models needs sophisticated operations. In 
addition, another heavy task carried out by the image analysis is to maintain, 
update and choose the models. 

In summary, two important observations about the image analysis part in the advanced 
video compression algorithms are that 

(1) Due to the involvement of image analysis, the complexity of the encoder is 
much higher than that of the decoder. 

(2) computational loads caused by image analysis in the beginning is usually 
heavier than that during the coding process. This requires that the encoder has 
a strong peak computational ability. 

(3) The decoder must operate exaaly as the encoder does. That is, the decoder 
is passively controlled by the encoder 

These points make advanced video algorithms difficult to implement in low power terminals 
to achieve live video communication. 

2 Smnmary of the invention 

To achieve video communication, very low power terminals should employ video 
compression algoriUims with the following features: 

(1) low complexity of the encoder and decoder 

(2) high compression efficiency 

(3) the operation of the encoder should be remotely controlled. 

To satisfy these requirements, it is not necessary to reinvent the wheel. This invention will 
make it possible to use current or future video compression algorithms to achieve video 
communication with low power terminals. The key idea of this invention is to move the 
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communication with low power terminals. The key idea of this invention is to move the 
image analysis pan to the receiver or an intermediate point in the network. This is 
illustrated in Fig.3. In this invention, we distinguish two different concepts: 
transmitter/receiver and encoder/decoder, which are mixed together in conventional video 
communication systems. In this invention, the encoder does not necessarily sit in the 
transmitter. Instead, a main part of the encoder remains in the transmitter while the 
computation consuming part, image analysis is put in the receiver. In this way, image 
analysis is performed in the receiver instead of the transmitter. The generated animation and 
model data are then sent back to the transmitter to run normal encoding processing. 
Therefore, the encoder is still a virtually complete system but physically is distributed 
across both the transmitter and receiver. Obviously, implicit conditions to enable such 
communication is that the receiver has enough power to perform complex image analysis 
and the latency between the receiver and transmitter should be low. In fact, the second 
condition, low latency is not necessary in our invention if model-based or object-oriented 
coding schemes are used. The kye is the first condition, which is easily satisfied in the 
applications mentioned in section 1. For example, a high performance computer or server 
can be used to communicate with these low power terminals. In principle, this invention 
enables a supercomputer performance at a very low cost. 

Furthermore, this invention is also supported by communication possibility. According to 
the Shannon information theory, the channel capacity is determined by 

C = Blogil + -) (1) 

The increase in the signal power will increase the channel capacity. Since at the receiver 
side, sufficient power is available, it is no problem to transmit model and animation data 
to the transmitter. 

In the case that the transmitter has certain computational ability and the latency is low, a 
second configuration scheme can be employed, which is illustrated in Fig.4. In this scheme 
two analysis blocks are employed. These two analysis blocks are either {\sf hierarchical}, 
e.g. the image analysis part in the receiver performs rough estimation of animation and 
model data while the one in the transmitter refines the obtained results from the receiver, 
or {\sf separate}, e.g. the analysis in die transmitter is in charge of tracking while the one 
in the receiver is in charge of initial estimation. 

3 Description of the preferred embodiments 

In our invention, the receiver will take part in the coding processing. The image analysis 
is performed in the receiver. In comparison with the original scheme, a significantly 
different part is that the input to the image analysis block is no longer the current frame but 
the previous frame. Therefore, a key issue for image analysis is its ability to predict the 
parameters which are normally from the current frame. Assume that previously decoded 
frames, and /*,.2 are available at the receiver side and the frame X,,^ is also available 
at the transmitter side. Now the task is to transmit the current frame 

Image analysis is performed by using the previously decoded frames f^_j and t,,2 based on 
the model V*,.; in the receiving side: 
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[AuMtf = analysism.i, /;_2, V',_i) (2) 

• 

where A, are the animation data and M, are the model data which will be used to update 
the employed model. The obtained animation data and model data are compressed and then 
transmitted back to the transmitter. 

At both the transmitter and receiver the employed model is updated by the model data 

Vt = modeLupdate{Vt^i, Mt) (3) 
Here we distinguish two cases: 



case I: no refinement 



At^At 
Mt = Mt 

Vt = Vt (4) 



case II: refinement 



A new image analysis taking the current frame as input is performed to refine the results 
estimated from previously decoded frames 

[AuMtf = analysisjrefine{It, At»Mt, Vt) (5) 
and the employed model is also refined 

Vt = model-refineiVt, Mt) (6) 

At the transmitter, the prediction of the current frame can be rendered by an image 
synthesis operation based on the updated model V„ 

I't^renderiAt.VtJU) 

The texture residual signal is 

Sit = (8) 

The residual information will be compressed by applying spatial coding methods, e.g. DCT 
on the residual signal. 

R, = DCT{5It) 

Note that in case II the animation and model residual signals &4, and dM^ are also needed 
to be compressed and transmitted to the receiver. These two residual signals are defined as 
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6At = At-At (10) 

and 

SMt = Mt - A/t ^^^^ 

4 Spedfied techniques 

4. IMotion prediction technique for two-way mobile communication 

In two-way mobile communication, one of the main problems is that the available channel 
capacity is very limited, e.g. 9.6kb/s for GSM standard. To transmit video through the 
mobile channel, advanced video coding algorithms must be employed. H.263, a video 
coding algorithm standardized by ITU[1] is very suitable for this purpose. This standard 
coding algorithm can also be described by the scheme shown in Fig.l. Since H.263 is a 
block-based video compression algorithm, from table 1, we will see that image analysis 
corresponds to block-based motion estimation}, image synthesis to block based motion 
compensation. There is no information about modelling. 



The main problem with this kind of configuration is that the limited power available in the 
mobile terminals prohibits complicated signal processing operations, e.g. motion estimation. 
In addition, both the transmitter and receiver have low power. Fortunately, there is no 
direct communication necessary for these two low power terminals. The two-way 
communication is carried out through basestations where sufficient power is available. 

A practical solution is to move die motion estimation part from the transmitter to the base 
station. That is, the encoder virtually sits across the transmitting terminal and the base 
station. Now the task of image analysis is to estimate animation data in the base station. 
Since block-based algorithm is employed, the animation data contains motion information 
only. 

Assume that previously decoded frames, 7*^;, /*^2, are available at the base station and the 
frame T*,., is also available at the transmitter and the receiver. To make a prediction of the 
current frame, motion vectors for the current frame must be provided. This is done in the 
base station where motion prediction has to be made 

At = analysis{i;_ij;„2) (^^^ 

The obtained motion information A^ is then transmitted to both the transmitter and the 
receiver. 

With the obtained motion vectors A^, the motion vectors to be used can be obtained by 
either A^—\ or A^=anatysis_r€fine{l^Xt-n^ depending on which schemes are used. Thus, 
a motion compensated prediction of the current frame is given by 

l[=-render{Atj;_^) (13) 
Obviously, the key to this technique lies in die performance of motion prediction (12). 
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Novy, we present a new technique to perform motion prediction. The core part in this 
invention is the employment of a new search strategy which allows us to make use of 
existing search methods, e.g. block-matching algorithm to achieve motion prediction. 

Assume that a block-matching scheme is employ. To predict the current frame 1. using the 
previous r,.„ the current frame must be segmented into rectangular blocks first and then one 
displacement vector (u*,v*) is to be estimated per block through 

mm \It{ni,n2) - i;_iini - u,n2 - v)\^ (14) 

where /? is a rectangular block centered on (Ui.nj) 

Since 1. is not available in die base station, the motion vector (u*,v* can not be recovered 
at time t from the constraint (14). To achieve motion prediction, the interframe motion is 
assumed to follow a physical motion model. Under this assumption, we have 

It{ni,n2) « JJLi(ni - u*,n2-v*) ^ /t*_2("i " a«*.«2 - bv*) (15) 

where o and d are constants which are specified by the assumed motion model. When a 
uniform motion is assumed, a=2 and b=2. 

Now the constraint (14) can be rewritten as 

mm y \i;_2{ni-au,n2-bv)-i;.i{ni-u,n2-v)\^ (16) 

The motion vector «*,v* can be estimated from the new constraint. 

The motion vectors tt*,v* are die animation data A, specified in equation (12). If no 
refinement is performed, then «=«* and v=v*. Now the motion compensated prediction 
of the current frame can be given 



/JCm.na) = Tender{At,i;_i{ni,n2)) = I^-iini - u.nj - 1;) 



(17) 



If a refinement operation is performed, a similar result can be obtained. 



4.2 3D Motion Prediction Technique for Object-Oriented Video Coding Applications 
In object-oriented video coding schemes, an important image analysis task is to estimate the 
3D motion of the objects appearing in the scene. For example, in videophone and video 
conference scenes, a typical object is the human head. The task there is to extract two types 
of information: fece definition parameters and face animation parameters. They correspond 
to model data and animation data defined in this invention. Advanced computer vision 
techniques are usually used to extract these data. As is known, these algorithms are 
computationally intensive, which usually can not be afforded by low power terminals. Our 
invention can handle this problem by moving the image analysis part to the receiver. The 
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core of this technique is to employ Kalman filtering to predict 3D motion of objects 
appearing in current frame. 

Assume the 3D motion of objects can be modeled as a dynamic process 

At+i = f{At) + ^ (18) 

where the function / models the dynamic evolution of the state vector At at time /. The 
measurement process is 

Yt = h{At)+r, (1^) 

where the sensor observations Fare a function h of the state vectors and time. Both f and 
Tj are white noise processes having known spectral density matrices. 

In the specific applications, the state vector A, consists of the motion parameters to be 
estimated and the observation vector Y, contains the available measurements, like 
pixel-based, feature-based measurements. The function/ may be a discrete-time Newtonian 
physical model of 3D motion. 

Using Kalman filtering, we can obtain die optimal linear estimate A, of the state vector A, 

At = A;+Kt{Yt-h{At)) (20) 
where A\\s die prediction of die state vector- A,. — 

The prediction of die state vector A,^, at die next time step is 

AUi = f{At) (21) 

if die measurement Y, at time t is available. 

Since the image analysis part is moved to the receiver, what is available there is the 
measurement 1"'. Therefore, we can only obtain die optimal linear estimate A,., of die state 
vector A^, 

At-i = AU + Kt-i (Vt-i - HAt-i)) (22) 



With, it the prediction of motion parameters of the current frame is obtained 

a; = f{At-i) (23) 



The predicted motion parameters A*, are sent to die transmitter. The transmitter synthesizes 
a prediction image /', of die current frame, then die residual signal between die prediction 
and current frame is compressed and transmitted to die receiver. The receiver reconstructs 
the current frame with the received residual signal. Then the current measurement Y, can 
be extracted. Widi Y„ die prediction of die next frame can be derived by using (20)(21). 
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More specifically, this technique can be used for three cases: 



Case I: the terminal has weak computational ability and the connection from 
the receiver to the terminal is of low delay} 

The predicted motion parameters are used in the terminal and only texture 
residual signal is transmitted to the receiver. 



Case H: the terminal has certain computational ability and the connection 
from the receiver to the terminal is of low delay 

The predicted motion parameters will be refined by using the current frame. 
The refined motion parameters are used to predict the current frame. The 
texture residual signal is transmitted to the receiver. Additional residual 
signals between the refined motion parameters and predicted ones are also 
transmitted to the receiver. 

Case DDL: the connection from the receiver to the terminal is of long (or 
irregular) delay} 

No information is sent from the transmitter to the receiver for a while, Tlie 
predicted motion parameters are used to drive animation at the receiver. 

The enclosed drawings show: 

Fig. 1: A unified video compression structure. 

Fig. 2: Image analysis based on the analysis by synthesis principle (ABS) 
Fig. 3: New implementation scheme where image analysis is performed at the receiver 
Fig. 4: Alternative implementation where image analysis is performed at both the 
transmitter and the receiver.} 
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CLAIMS 

1) Method for the transferring of moving pictures where a predicted picture frame is 
obtained through motion compensation techniques and then compared with the actual picture 
frame, a deviation signal is generated corresponding to the difference between the actual 
and the predicted frames and motion information is for compensation sent to the receiving 
station, characterized in the motion estimation being carried out outside the picture sending 
unit and in that the predicted motion information generated by motion estimation is not only 
sent to the receiving unit but also to the sending unit in which the predicted picture frame 
is compared with the real picture frame and the resulting deviation signal is sent to the 
motion compensating prediction unit and die receiving station or terminal. 

2) Method according to claim 1, characterized in that the motion prediction is carried out 
outside the sending unit, e.g. a base station in the mobile phone case or a central station in 
a wired phone case and that communication between the motion estimator and the sending 
unit and the receiving unit respectively is via ordinary wire or cellular telephone networks, 
or other media. 

3) Mediod according to claim 1 and 2, characterized in duplex communication with the 
motion estimation for both ends being arranged centrally or apart from the picture sending 
and receiving terminals. 

4) Method according to any of the preceding claims, characterized in a regular and 
additional or initial updating of the entire picture or parts thereof. 

5) Method according to any of the preceding claims^ characterized in that the information 
flow from the picture sending unit is smaller than or equal to that from the central base 
station to the picture sending unit. 

6) Communication network for the transfer of moving pictures in accordance with claim 1, 
characterized in that equipment for motion estimation is arranged in one or several base 
stations. 

7) Method for the transferring of moving pictures where a predicted picture frame is 
obtained through image analysis and then compared with the actual picture frame, a 
deviation signal is generated corresponding to the difference between the actual and the 
predicted frames and information is for compensation sent to the receiving station, 
characterized in the image analysis being carried out outside the picture sending unit and 
in that the predicted information generated by image analysis is not only sent to the 
receiving unit but also to the sending unit in which the predicted picture frame is compared 
with the real picture frame and the resulting deviation signal is sent to the compensating 
prediction unit and the receiving station or terminal. 

8) Method according to claim 7, characterized in that the motion prediction is carried out 
outside the sending unit, e.g. a base station in the mobile phone case or a central station in 
a wired phone case and that communication between the motion estimator and the sending 
unit and the receiving unit respectively is via ordinary wire or cellular telephone networks, 
or other media. 



wo 99/02003 



PCT/SE98/00030 



10 

9) Method according to claim 7 or 8, characterized in duplex communication with the 
image analysis for both ends being arranged centrally or apart from the picture sending and 
receiving terminals. 

10) Method according to any of the preceding claims 7 to 9, characterized in a regular and 
additional or initial updating of the entire picture or parts thereof. 

11) Method according to any of the preceding claims 7 to 10, characterized in that the 
information flow from the picture sending unit is smaller than or equal to that from the 
central base station to the picture sending unit. 

12) Communication network for the transfer of moving pictures in accordance with claim 
7, characterized in that equipment for image analysis is arranged in one or several base 
stations. 

13) Method for the transferring of moving pictures where a predicted picture frame is 
obtained through image analysis and then compared with the actual picture frame, a 
deviation signal is generated corresponding to the difference between the actual and the 
predicted frames and information is for compensation sent to the receiving station, 
characterized in the image analysis being carried out outside the picture sending unit and 
in that the predicted information generated by image analysis is sent to the sending unit in 
which the predicted picture frame is compared with the real picture frame and the resulting 
deviation signal is sent to the compensating prediction unit. 

14) Method according to claim 13, characterized in the sending of the deviation signal also 
to a receiving station or terminal together with the predicted picture frame information. 

15) Method according to claim 14, characterized in that the image prediction is carried out 
outside the sending unit, e.g. a base station in the mobile phone case or a central station in 
a wired phone case and that communication between the prediction estimator and the 
sending unit and the receiving unit respectively is via ordinary wire or cellular telephone 
networks, or other media. 

16) Method according to claim 13 or 14, characterized in duplex communication with the 
image analysis for both ends being arranged centrally or apart from the picture sending and 
receiving terminals. 

17) Method according to any of the preceding claims 13 to 16, characterized in a regular 
and additional or initial updating of the entire picture or parts thereof. 

18) Method according to any of the preceding claims 13 to 17, characterized in that the 
information flow from the picture sending unit is smaller than or equal to that from the 
central base station to the picture sending unit. 

19. Method for video communication, characterized in the image analysis being carried 
out in the receiver or an intermediate point in a network. 

20. Method according to any of the previous claims characterized in the image analysis 
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being motion prediction. 

21) Communication network for the transfer of moving pictures in accordance with any of 
the previous claims 13 to 20, characterized in that equipment for image analysis is 
arranged in one or several base stations. 
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